Weakly Supervised Morphology Learning for Agglutinating Languages Using Small Training Sets
نویسندگان
چکیده
The paper describes a weakly supervised approach for decomposing words into all morphemes: stems, prefixes and suffixes, using wordforms with marked stems as training data. As we concentrate on under-resourced languages, the amount of training data is limited and we need some amount of supervision in the form of a small number of wordforms with marked stems. In the first stage we introduce a new Supervised Stem Extraction algorithm (SSE). Once stems have been extracted, an improved unsupervised segmentation algorithm GBUMS (GraphBased Unsupervised Morpheme Segmentation) is used to segment suffix or prefix sequences into individual suffixes and prefixes. The approach, experimentally validated on Turkish and isiZulu languages, gives high performance on test data and is comparable to a fully supervised method.
منابع مشابه
Multilingual Noise-Robust Supervised Morphological Analysis using the WordFrame Model
This paper presents the WordFrame model, a noiserobust supervised algorithm capable of inducing morphological analyses for languages which exhibit prefixation, suffixation, and internal vowel shifts. In combination with a näive approach to suffix-based morphology, this algorithm is shown to be remarkably effective across a broad range of languages, including those exhibiting infixation and part...
متن کاملLabeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods
In this paper we consider the problem of labeling the languages of words in mixed-language documents. This problem is approached in a weakly supervised fashion, as a sequence labeling problem with monolingual text samples for training data. Among the approaches evaluated, a conditional random field model trained with generalized expectation criteria was the most accurate and performed consisten...
متن کاملComposite Kernel Optimization in Semi-Supervised Metric
Machine-learning solutions to classification, clustering and matching problems critically depend on the adopted metric, which in the past was selected heuristically. In the last decade, it has been demonstrated that an appropriate metric can be learnt from data, resulting in superior performance as compared with traditional metrics. This has recently stimulated a considerable interest in the to...
متن کاملLanguage Segmentation
Language segmentation consists in finding the boundaries where one language ends and another language begins in a text wrien in more than one language. is is important for all natural language processing tasks. e problem can be solved by training language models on language data. However, in the case of lowor no-resource languages, this is problematic. I therefore investigate whether unsuper...
متن کاملSupervised Morphological Segmentation in a Low-Resource Learning Setting using Conditional Random Fields
We discuss data-driven morphological segmentation, in which word forms are segmented into morphs, the surface forms of morphemes. Our focus is on a lowresource learning setting, in which only a small amount of annotated word forms are available for model training, while unannotated word forms are available in abundance. The current state-of-art methods 1) exploit both the annotated and unannota...
متن کامل